This is an exploration of the tidy data set wineQualityReds.csv provided by Udacity Data Analyst Nanodegree for Project 3. This data set was chosen for the brevity of the observations in consideration for execution time of certain plots. Guiding question: “Which Variables affect Red Wine Quality?”
Initial Explorations:
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Some parameter names may be a bit long to plot on ggpairs, perhaps a renaming of the parameters will be needed later. There are 1599 observations of red wine in the data-set (rather small considering other data-sets). Alcohol content of red wines range from 8.40% to 14.90% with most around 10%. The quality rating of reds in this data set are mostly between 5 and 6 with median at 6. pH of reds are stable around 3-4,
It seems that fixed acidity has little to do with the quality of red wines, and that there are more quality 5 to 7 wines than there qualities of other types, this is also noted in the readme wineQualityInfo.txt. Perhaps it would be better to combine the bottom two levels and the top two levels.
The histogram grid took some research to create (at first attempted a function that did not work as intended). This provided a good overview of the distribution of the different chemical attributes, residual.sugar and chlorides and perhaps sulphates could possibly use a closer look at the X-scaling for they seem to be more long-tailed. An extra note, density seems to be rather normally distributed.
From the scatter plots, sulphates, alcohol content seems to have a bit of correlation with quality where higher alcohol content seems to indicate higher quality, but the variance is still pretty high.
The boxplots seem to reveal some interesting trends where the other plots showed relatively little. A re-scaling of the Y-axis for residual.sugar, chlorides, and sulphates might reveal a better view of the boxplots.
##
## Pearson's product-moment correlation
##
## data: quality and fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and pH
## t = -37.3659, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7082857 -0.6559174
## sample estimates:
## cor
## -0.6829782
A sanity check with fixed acidity vs pH shows that as expected there is a reasonably high correlation: -.68. Looking at the correlation coefficients, it seems that alcholol, volatile acidity, sulphates, and acid acid have the four highest correlation with quality. pH, residual sugar and free sulfur dioxide are seemingly uncorrelated with quality.
Citric Acid
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## [1] "(0.001,0.15]" "(0.15,0.3]" "(0.3,0.5]" "(0.5,1.2]"
##
## Pearson's product-moment correlation
##
## data: quality and as.numeric(cAcid.cut)
## t = 9.2631, df = 1465, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1862839 0.2829939
## sample estimates:
## cor
## 0.235221
Created the categorical variable cAcid cut from citric.acid by looking at the histogram (which is relatively evenly distributed) and then maximizing the correlation. I feel that more categorical variables could help simply and visualize how each attribute may be attributing to the quality factor. Also, correlation increases with the new categorical variable.
Free and Total Sulfur Dioxide
## [1] "(0,59]" "(59,109]" "(109,300]"
##
## Pearson's product-moment correlation
##
## data: quality and as.numeric(total.sulf.dioxide.cut)
## t = -8.6666, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2582798 -0.1646315
## sample estimates:
## cor
## -0.2119421
## X fixed.acidity volatile.acidity citric.acid
## Min. : 15.0 Min. : 5.900 Min. :0.1900 Min. :0.0000
## 1st Qu.: 447.0 1st Qu.: 6.600 1st Qu.:0.3775 1st Qu.:0.0450
## Median :1057.5 Median : 7.350 Median :0.6000 Median :0.2000
## Mean : 901.2 Mean : 7.883 Mean :0.5178 Mean :0.2072
## 3rd Qu.:1296.8 3rd Qu.: 8.900 3rd Qu.:0.6300 3rd Qu.:0.3300
## Max. :1559.0 Max. :11.800 Max. :0.7350 Max. :0.4900
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 1.700 Min. :0.04500 Min. :50.0
## 1st Qu.: 2.575 1st Qu.:0.07575 1st Qu.:51.0
## Median : 4.300 Median :0.10550 Median :52.5
## Mean : 5.950 Mean :0.12322 Mean :56.0
## 3rd Qu.: 7.600 3rd Qu.:0.16950 3rd Qu.:56.5
## Max. :15.400 Max. :0.23500 Max. :72.0
## total.sulfur.dioxide density pH sulphates
## Min. : 63.0 Min. :0.9934 Min. :3.160 Min. :0.4400
## 1st Qu.: 77.5 1st Qu.:0.9957 1st Qu.:3.200 1st Qu.:0.5300
## Median : 96.5 Median :0.9979 Median :3.290 Median :0.7200
## Mean :103.7 Mean :0.9979 Mean :3.312 Mean :0.6750
## 3rd Qu.:124.0 3rd Qu.:0.9992 3rd Qu.:3.438 3rd Qu.:0.8075
## Max. :160.0 Max. :1.0037 Max. :3.590 Max. :0.9300
## alcohol quality cAcid.cut total.sulf.dioxide.cut
## Min. : 9.000 Min. :5.000 (0.001,0.15]:3 (0,59] : 0
## 1st Qu.: 9.425 1st Qu.:5.000 (0.15,0.3] :5 (59,109] :10
## Median : 9.500 Median :5.000 (0.3,0.5] :7 (109,300]: 8
## Mean : 9.978 Mean :5.556 (0.5,1.2] :0
## 3rd Qu.:10.275 3rd Qu.:6.000 NA's :3
## Max. :12.900 Max. :7.000
##
## Pearson's product-moment correlation
##
## data: quality and as.numeric(free.cut)
## t = -1.9126, df = 1597, p-value = 0.05598
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.096599360 0.001219402
## sample estimates:
## cor
## -0.04780459
The readme hinted that there might be adverse tastes when free sulfur dioxide values becomes greater than 50ppm. There was an increase in correlation from the numerical version of ‘free.sulfur.dioxide’ parameter to the categorical one, perhaps it will be useful in gaining insight to the quality parameter? I also turned ‘total.sulfure.dioxide’ variable into a categorical variable (saw an increase in correlation with ‘quality’)
Chlorides and Density
## [1] "(0,0.07]" "(0.07,0.079]" "(0.079,0.09]" "(0.09,0.7]"
##
## Pearson's product-moment correlation
##
## data: quality and as.numeric(chloride.cut)
## t = -7.0399, df = 1597, p-value = 2.846e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2206370 -0.1255389
## sample estimates:
## cor
## -0.1734924
Both Density and chlorides have very normal distributions, and have reasonable correlations with quality. After resizing the x-axis, it seems that chlorides is a very normally distributed variable. Turning chloride into a categorical variable increased it’s correlation with quality from -0.12 to -0.17.
## [1] "(0,0.996]" "(0.996,0.997]" "(0.997,1.1]"
##
## Pearson's product-moment correlation
##
## data: quality and as.numeric(density.cut)
## t = -7.0216, df = 1597, p-value = 3.232e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2202076 -0.1250947
## sample estimates:
## cor
## -0.1730546
Both new categorical variables show an increased correlation with quality. Does this mean if a regression model is made, the categorical inputs will yield a better model?
Rabbit Hole Idea
wqr$dcut<- cut(wqr$density,
breaks = c(0, quantile(wqr$density, 1/12),quantile(wqr$density, 2/12),
quantile(wqr$density, 3/12),quantile(wqr$density, 4/12),
quantile(wqr$density, 5/12),quantile(wqr$density, 6/12),
quantile(wqr$density, 7/12),quantile(wqr$density, 8/12),
quantile(wqr$density, 9/12),quantile(wqr$density, 10/12),
quantile(wqr$density, 11/12),quantile(wqr$density, 12/12)))
levels(wqr$dcut)
## [1] "(0,0.9942]" "(0.9942,0.9951]" "(0.9951,0.9956]"
## [4] "(0.9956,0.996]" "(0.996,0.9964]" "(0.9964,0.9968]"
## [7] "(0.9968,0.9971]" "(0.9971,0.9974]" "(0.9974,0.9978]"
## [10] "(0.9978,0.9984]" "(0.9984,0.9994]" "(0.9994,1.004]"
levels(wqr$dcut)<-c("1","2","3","4","5","6","6","5","4","3","2","1")
with(wqr, cor.test(quality, as.numeric(dcut), method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: quality and as.numeric(dcut)
## t = -6.8116, df = 1597, p-value = 1.363e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2152739 -0.1199932
## sample estimates:
## cor
## -0.168026
There was a thought that evenly distributed variables like density and pH despite the low correlation might have more to say. The idea is simple, “high density, or pH or low density or pH” will possibly produce ‘bad’ quality wines (extremes tend toward the negative outcome in other words). As wine makers try to create ‘good’ wines, the independent variables should have a tendency toward the middle of the distribution. By cutting the data into 12 equal parts and then combining the ends (high and low pairs) together, the new categorical variable might reveal better correlation with the ‘quality’ factor. A slight drop in correlation was found :(.
Something I was looking for in the plots were perhaps a higher number of low quality rankings associated category 1 of the dcut (which is the combination of the highest and lowest density after cutting the samples into 12 evenly populated segments). Plots do not reveal anything significant, although very colorful. Perhaps pH will reveal something.
wqr$pHcut<- cut(wqr$pH,
breaks = c(0, quantile(wqr$pH, 1/12),quantile(wqr$pH, 2/12),
quantile(wqr$pH, 3/12),quantile(wqr$pH, 4/12),
quantile(wqr$pH, 5/12),quantile(wqr$pH, 6/12),
quantile(wqr$pH, 7/12),quantile(wqr$pH, 8/12),
quantile(wqr$pH, 9/12),quantile(wqr$pH, 10/12),
quantile(wqr$pH, 11/12),quantile(wqr$pH, 12/12)))
levels(wqr$pHcut)
## [1] "(0,3.1]" "(3.1,3.16]" "(3.16,3.21]" "(3.21,3.25]" "(3.25,3.28]"
## [6] "(3.28,3.31]" "(3.31,3.34]" "(3.34,3.37]" "(3.37,3.4]" "(3.4,3.45]"
## [11] "(3.45,3.53]" "(3.53,4.01]"
levels(wqr$pHcut)<-c("1","2","3","4","5","6","6","5","4","3","2","1")
with(wqr, cor.test(quality, as.numeric(pHcut), method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: quality and as.numeric(pHcut)
## t = 0.422, df = 1597, p-value = 0.6731
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03848218 0.05954919
## sample estimates:
## cor
## 0.01055887
with(wqr, table(factor(quality), pHcut))
## pHcut
## 1 2 3 4 5 6
## 3 2 3 0 3 0 2
## 4 12 6 8 7 6 14
## 5 104 133 117 107 95 125
## 6 97 98 125 102 85 131
## 7 33 24 38 35 32 37
## 8 4 5 2 5 2 0
with(subset(wqr, as.numeric(wqr$dcut)>=5), table(factor(quality), pHcut))
## pHcut
## 1 2 3 4 5 6
## 3 1 0 0 0 0 1
## 4 1 3 3 3 2 6
## 5 30 50 60 48 37 53
## 6 23 24 37 27 35 50
## 7 6 5 9 11 5 11
## 8 0 3 0 0 1 0
Not looking good with pH, maybe a combination of pH and density will reveal an unexpected correlation. where the comination of pH and density may reveal some clustering of higher or lower quality
pH investigation dead-end, though I still haven’t given up that there might be something in the normally distributed data. Maybe Chlorides and residual sugar have some story to tell.
## [1] 1.964472
## [1] 3.202083
## [1] 1.848927
## [1] 2.078141
## [1] 1.881038
## [1] 1.677124
## [1] 0.004387833
## [1] 0.005805184
## [1] 0.002884486
## [1] 0.001565254
## [1] 0.0008676273
## [1] 0.0001363791
## [1] 0.02075111
## [1] 0.03292075
## [1] 0.02268592
## [1] 0.02371449
## [1] 0.02253024
## [1] 0.04025654
## [1] 4.007382e-06
## [1] 2.481157e-06
## [1] 2.523346e-06
## [1] 4.000036e-06
## [1] 4.733842e-06
## [1] 5.656195e-06
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Can people actually taste .1 g/liter of chloride/salt? Note: our chloride and residual sugar measurements are in g/dm^3 which is the same as g/liter. This might be a futile exercise. Looked at variances of several variables to see if there is a trend of increasing or decreasing variance. This could indicate that the tails of the distribution might contain correlation to one particular quality level once again the ‘Too much or too little yields bad quality’ theory.
wqr$chlcut<- cut(wqr$chlorides,
breaks = c(0, quantile(wqr$chlorides, 1/12),quantile(wqr$chlorides, 2/12),
quantile(wqr$chlorides, 3/12),quantile(wqr$chlorides, 4/12),
quantile(wqr$chlorides, 5/12),quantile(wqr$chlorides, 6/12),
quantile(wqr$chlorides, 7/12),quantile(wqr$chlorides, 8/12),
quantile(wqr$chlorides, 9/12),quantile(wqr$chlorides, 10/12),
quantile(wqr$chlorides, 11/12),quantile(wqr$chlorides, 12/12)))
levels(wqr$chlcut)
## [1] "(0,0.059]" "(0.059,0.066]" "(0.066,0.07]" "(0.07,0.074]"
## [5] "(0.074,0.076]" "(0.076,0.079]" "(0.079,0.082]" "(0.082,0.085]"
## [9] "(0.085,0.09]" "(0.09,0.097]" "(0.097,0.114]" "(0.114,0.611]"
levels(wqr$chlcut)<-c("1","2","3","4","5","6","6","5","4","3","2","1")
summary(wqr$chlorides, 10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
ggplot(aes(x=chlorides, y=quality, color=chlcut), data=wqr) +
geom_point(alpha = 1/2, position="jitter", size=3)
ggplot(aes( y=quality, x=chlcut), data=wqr) +
geom_point(alpha = 1/2, position='jitter', size=3) +
ggtitle("Despair Never Looked so... Colorful Why am I still doing this") +
scale_color_brewer(type = 'seq', palette = 'Blues',
guide = guide_legend(title = 'pH', reverse = F,
override.aes = list(alpha = 1, size = 1)))
with(wqr, table(factor(quality), chlcut))
## chlcut
## 1 2 3 4 5 6
## 3 4 1 1 1 1 2
## 4 12 7 12 11 4 7
## 5 98 114 101 132 93 143
## 6 112 104 102 111 84 125
## 7 45 50 34 35 21 14
## 8 3 3 3 5 2 2
The wikipedia article on “Taste” revealed that the average human detection threshold for sucrose is 10 millimoles per liter which actually translates to .01g/liter. Further searching has some studies showing that salt detection threshold is around .5 mols/liter which translates to .5 grams/liter. Most likely the difference of salts contained in these wines are not at levels that are able to be readily distinguished by humans with a range of (.012-.611 g/liter). The differences in sugar should be able to be detected however. But once again, the ‘folding’ of the distribution has not revealed any new correlations.
Alcohol
Alcohol vs quality (a little backwards in the axis but it seems more natural to place quality on the x for visualization purposes). Added the median to see the the possible linear relationship with quality.
## [1] "(2,4]" "(4,5]" "(5,6]" "(6,7]" "(7,8]"
##
## Pearson's product-moment correlation
##
## data: as.numeric(qual.cut) and alcohol
## t = 22.0304, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4442698 0.5195032
## sample estimates:
## cor
## 0.4827767
Created a new categorical variable out of ‘quality’ called ‘qual.cut’ by merging the bottom 2 categories in ‘quality’. Hope to make trends a bit clearer and provide a cleaner view of the bins.
Volatile Acidity
The negative correlation between Volatile Acidity and Citric Acid can be seen the above histogram. The higher the Volatile Acidity, the lower the Citric Acid metric. There seems to be a reasonably strong correlation shown between Total Sulfur Dioxide and quality.
Sulphates
Sulphates seem to be linearly correlated with quality as shown by the ggpairs correlation matrix this is a good variable add to a model.
Feeling somewhat unsatisfied with the current findings and the lack of elements to account for the variance in the quality ratings (although quality ratings are subjective and discrete), I started reading about how wine is rated. There are many subjective measures in rating wines, but the one that makes the most sense is Appearance (Visual), Aroma(smell), Taste, and Aftertaste (finish). Perhaps it’s best to see how good of a model we can obtain from the current inputs.
Modeling the Data
##
## Calls:
## m1: lm(formula = I(quality) ~ alcohol, data = wqr)
## m2: lm(formula = I(quality) ~ alcohol + sulphates, data = wqr)
## m3: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity,
## data = wqr)
## m4: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut, data = wqr)
## m5: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut + total.sulf.dioxide.cut, data = wqr)
## m6: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut + total.sulf.dioxide.cut + chloride.cut, data = wqr)
## m7: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut,
## data = wqr)
## m8: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut +
## pH, data = wqr)
## m9: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut +
## pH + fixed.acidity, data = wqr)
## m10: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut +
## pH + fixed.acidity + residual.sugar, data = wqr)
## m11: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut +
## pH + fixed.acidity + residual.sugar + free.sulfur.dioxide,
## data = wqr)
##
## =======================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 1.375*** 2.611*** 2.556*** 2.665*** 2.697*** 2.750*** 4.080*** 3.732*** 3.646*** 3.704***
## (0.175) (0.177) (0.196) (0.218) (0.217) (0.230) (0.247) (0.492) (0.673) (0.676) (0.676)
## alcohol 0.361*** 0.346*** 0.309*** 0.317*** 0.301*** 0.295*** 0.290*** 0.304*** 0.300*** 0.292*** 0.295***
## (0.017) (0.016) (0.016) (0.017) (0.017) (0.018) (0.020) (0.021) (0.021) (0.022) (0.022)
## sulphates 0.994*** 0.679*** 0.630*** 0.670*** 0.705*** 0.712*** 0.686*** 0.694*** 0.710*** 0.693***
## (0.102) (0.101) (0.105) (0.104) (0.106) (0.106) (0.106) (0.107) (0.107) (0.108)
## volatile.acidity -1.221*** -1.170*** -1.067*** -1.024*** -1.012*** -1.026*** -1.031*** -1.035*** -1.009***
## (0.097) (0.123) (0.123) (0.125) (0.126) (0.126) (0.126) (0.126) (0.127)
## cAcid.cut: (0.15,0.3]/(0.001,0.15] -0.062 -0.009 -0.003 0.002 -0.030 -0.033 -0.032 -0.025
## (0.049) (0.049) (0.049) (0.050) (0.051) (0.051) (0.051) (0.051)
## cAcid.cut: (0.3,0.5]/(0.001,0.15] -0.023 0.026 0.037 0.047 -0.014 -0.026 -0.028 -0.020
## (0.052) (0.052) (0.053) (0.056) (0.059) (0.061) (0.061) (0.061)
## cAcid.cut: (0.5,1.2]/(0.001,0.15] 0.030 0.055 0.091 0.106 0.010 -0.010 -0.013 0.000
## (0.065) (0.065) (0.066) (0.071) (0.077) (0.081) (0.081) (0.082)
## total.sulf.dioxide.cut: (59,109]/(0,59] -0.112** -0.106* -0.107* -0.098* -0.092* -0.098* -0.142**
## (0.043) (0.043) (0.043) (0.043) (0.043) (0.044) (0.050)
## total.sulf.dioxide.cut: (109,300]/(0,59] -0.396*** -0.389*** -0.390*** -0.399*** -0.384*** -0.401*** -0.468***
## (0.073) (0.074) (0.074) (0.074) (0.076) (0.077) (0.085)
## chloride.cut: (0.07,0.079]/(0,0.07] 0.012 0.016 0.018 0.016 0.016 0.015
## (0.049) (0.050) (0.050) (0.050) (0.050) (0.050)
## chloride.cut: (0.079,0.09]/(0,0.07] -0.026 -0.019 -0.019 -0.021 -0.017 -0.018
## (0.051) (0.052) (0.052) (0.052) (0.052) (0.052)
## chloride.cut: (0.09,0.7]/(0,0.07] -0.104* -0.097 -0.119* -0.118* -0.120* -0.121*
## (0.053) (0.054) (0.054) (0.054) (0.054) (0.054)
## density.cut: (0.996,0.997]/(0,0.996] -0.023 -0.015 -0.027 -0.039 -0.031
## (0.049) (0.049) (0.051) (0.052) (0.052)
## density.cut: (0.997,1.1]/(0,0.996] -0.032 -0.026 -0.055 -0.087 -0.075
## (0.054) (0.053) (0.066) (0.069) (0.069)
## pH -0.431** -0.345 -0.316 -0.352
## (0.138) (0.179) (0.180) (0.181)
## fixed.acidity 0.015 0.019 0.017
## (0.020) (0.020) (0.020)
## residual.sugar 0.019 0.016
## (0.013) (0.013)
## free.sulfur.dioxide 0.004
## (0.002)
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.270 0.336 0.335 0.349 0.351 0.352 0.356 0.356 0.357 0.359
## adj. R-squared 0.226 0.269 0.335 0.332 0.345 0.347 0.346 0.350 0.350 0.350 0.351
## sigma 0.710 0.690 0.659 0.658 0.651 0.650 0.651 0.649 0.649 0.649 0.648
## F 468.267 294.988 268.912 122.348 97.582 71.693 60.624 57.329 53.529 50.371 47.674
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1675.142 -1599.384 -1463.199 -1447.445 -1444.307 -1444.121 -1439.214 -1438.925 -1437.771 -1436.117
## Deviance 805.870 760.894 692.105 631.385 617.968 615.331 615.174 611.073 610.832 609.872 608.498
## AIC 3448.114 3358.284 3208.768 2942.398 2914.889 2914.614 2918.241 2910.428 2911.850 2911.541 2910.234
## BIC 3464.245 3379.793 3235.654 2984.726 2967.799 2983.397 2997.606 2995.084 3001.797 3006.779 3010.762
## N 1599 1599 1599 1467 1467 1467 1467 1467 1467 1467 1467
## =======================================================================================================================================================
I decided to run a linear regression test to see how much of the variance in quality I can capture. The result is about 36% of the variance with all independent variables modeled. The same model was run with the original non-categorical variables, the resulting model was about 1-2% worse. There are several considerations:
##
## Calls:
## m1: glm(formula = factor(quality) ~ alcohol, family = binomial(link = "probit"),
## data = wqr)
## m2: glm(formula = factor(quality) ~ alcohol + sulphates, family = binomial(link = "probit"),
## data = wqr)
## m3: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity,
## family = binomial(link = "probit"), data = wqr)
## m4: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut, family = binomial(link = "probit"), data = wqr)
## m5: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut + total.sulf.dioxide.cut, family = binomial(link = "probit"),
## data = wqr)
## m6: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut + total.sulf.dioxide.cut + chloride.cut, family = binomial(link = "probit"),
## data = wqr)
## m7: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut,
## family = binomial(link = "probit"), data = wqr)
## m8: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut +
## pH, family = binomial(link = "probit"), data = wqr)
## m9: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut +
## pH + fixed.acidity, family = binomial(link = "probit"), data = wqr)
## m10: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut +
## pH + fixed.acidity + residual.sugar, family = binomial(link = "probit"),
## data = wqr)
## m11: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity +
## cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut +
## pH + fixed.acidity + residual.sugar + free.sulfur.dioxide,
## family = binomial(link = "probit"), data = wqr)
##
## ========================================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 0.597 -0.248 1.362 0.496 0.823 1.548 1.621 10.570 14.461 14.996 14.374
## (1.339) (1.447) (2.037) (2.491) (2.700) (2.865) (3.222) (6.242) (8.612) (8.747) (8.777)
## alcohol 0.186 0.164 0.322 0.364 0.514* 0.524 0.606* 0.711* 0.779* 0.824* 0.885*
## (0.133) (0.139) (0.205) (0.232) (0.254) (0.268) (0.308) (0.344) (0.361) (0.374) (0.401)
## sulphates 1.757 -0.082 0.073 -0.806 -0.836 -1.478 -0.985 -0.867 -0.890 -0.469
## (1.133) (0.820) (1.008) (1.244) (1.316) (1.542) (1.782) (1.879) (1.874) (2.048)
## volatile.acidity -3.057*** -2.626** -4.380*** -4.807*** -5.618*** -5.913*** -5.869** -5.778** -5.839***
## (0.673) (0.809) (1.220) (1.350) (1.665) (1.780) (1.785) (1.764) (1.762)
## cAcid.cut: (0.15,0.3]/(0.001,0.15] 4.066 4.725 4.913 4.992 4.561 4.656 4.593 4.761
## (300.397) (598.195) (565.818) (531.428) (546.492) (540.848) (543.532) (526.212)
## cAcid.cut: (0.3,0.5]/(0.001,0.15] -0.197 -0.913 -1.050 -0.737 -1.329 -1.133 -1.158 -1.026
## (0.369) (0.520) (0.562) (0.591) (0.722) (0.773) (0.774) (0.777)
## cAcid.cut: (0.5,1.2]/(0.001,0.15] -0.312 -0.758 -0.767 -0.382 -1.196 -0.862 -0.881 -0.807
## (0.518) (0.620) (0.643) (0.646) (0.851) (0.950) (0.944) (0.966)
## total.sulf.dioxide.cut: (59,109]/(0,59] 5.519 5.793 6.169 6.014 6.053 6.375 6.663
## (635.708) (609.846) (554.753) (568.242) (555.182) (528.901) (522.764)
## total.sulf.dioxide.cut: (109,300]/(0,59] 6.677 7.192 7.833 7.756 7.544 7.395 7.868
## (1068.682) (1013.094) (953.589) (936.360) (941.298) (958.041) (950.906)
## chloride.cut: (0.07,0.079]/(0,0.07] -0.673 -0.394 -0.317 -0.286 -0.245 -0.261
## (0.619) (0.632) (0.657) (0.658) (0.659) (0.670)
## chloride.cut: (0.079,0.09]/(0,0.07] -0.524 -0.013 -0.040 -0.086 -0.059 -0.122
## (0.605) (0.652) (0.664) (0.671) (0.669) (0.687)
## chloride.cut: (0.09,0.7]/(0,0.07] -0.497 -0.234 -0.238 -0.258 -0.264 -0.291
## (0.613) (0.625) (0.656) (0.658) (0.654) (0.676)
## density.cut: (0.996,0.997]/(0,0.996] 0.448 0.195 0.246 0.282 0.383
## (0.704) (0.736) (0.755) (0.755) (0.755)
## density.cut: (0.997,1.1]/(0,0.996] -0.761 -0.842 -0.598 -0.459 -0.450
## (0.637) (0.714) (0.819) (0.855) (0.859)
## pH -2.912 -3.920 -4.144 -4.106
## (1.686) (2.303) (2.377) (2.454)
## fixed.acidity -0.178 -0.197 -0.204
## (0.246) (0.244) (0.250)
## residual.sugar -0.084 -0.095
## (0.177) (0.178)
## free.sulfur.dioxide -0.023
## (0.028)
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## Aldrich-Nelson R-sq. 0.001 0.004 0.018 0.014 0.021 0.022 0.024 0.026 0.027 0.027 0.027
## McFadden R-sq. 0.019 0.047 0.246 0.231 0.348 0.364 0.413 0.448 0.454 0.456 0.463
## Cox-Snell R-sq. 0.001 0.004 0.018 0.014 0.021 0.022 0.025 0.027 0.027 0.027 0.028
## Nagelkerke R-sq. 0.020 0.049 0.253 0.236 0.355 0.371 0.420 0.455 0.461 0.463 0.471
## phi 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Likelihood-ratio 2.290 5.681 29.826 20.499 30.874 32.296 36.644 39.768 40.294 40.471 41.134
## p 0.130 0.058 0.000 0.002 0.000 0.001 0.000 0.000 0.000 0.001 0.001
## Log-likelihood -59.569 -57.873 -45.801 -34.149 -28.962 -28.251 -26.077 -24.515 -24.252 -24.163 -23.832
## Deviance 119.139 115.747 91.602 68.298 57.924 56.502 52.153 49.030 48.503 48.326 47.663
## AIC 123.139 121.747 99.602 82.298 75.924 80.502 80.153 79.030 80.503 82.326 83.663
## BIC 133.893 137.878 121.111 119.335 123.542 143.994 154.227 158.395 165.159 172.273 178.901
## N 1599 1599 1599 1467 1467 1467 1467 1467 1467 1467 1467
## ========================================================================================================================================================================================
The generalized linear model treating the dependent variable ‘quality’ as a categorical variable yield a better model, at least in the R-squared value sense yielding a McFadden R-sq. of .481 which seems to indicate that the model is a good fit for the data (non-categorical stands at R-sq of .430). I still wonder if a non-linear approach might be able to model the data more effectively.
Sulfur Dioxide and Sulphate Exploration
## cor
## 0.6676665
## cor
## 0.05165757
## cor
## 0.667776
##
## Calls:
## modFTS: lm(formula = free.sulfur.dioxide ~ total.sulfur.dioxide + sulphates +
## residual.sugar, data = wqr)
##
## ===============================
## (Intercept) 4.230***
## (0.871)
## total.sulfur.dioxide 0.209***
## (0.006)
## sulphates 1.432
## (1.148)
## residual.sugar 0.399**
## (0.141)
## -------------------------------
## R-squared 0.449
## adj. R-squared 0.448
## sigma 7.771
## F 433.388
## p 0.000
## Log-likelihood -5545.516
## Deviance 96325.376
## AIC 11101.032
## BIC 11127.918
## N 1599
## ===============================
## (Intercept) total.sulfur.dioxide sulphates
## 4.2303610 0.2085175 1.4315258
## residual.sugar
## 0.3990285
##
## Pearson's product-moment correlation
##
## data: free.sulfur.dioxide and sulfcomb
## t = 35.8786, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6400010 0.6943446
## sample estimates:
## cor
## 0.6680626
Surprisingly, linear combination between ‘sulphates’ and ‘total.sulfur.dioxide’ did not improve correlation with ‘free.sulfur.dioxide’. Another dead-end where hypothesis and testing yielded a negative result. Ready to move on at this point.
Bonus Obvervations ‘X’
A tip from a friend lead me to the following investigation. Is there any bias in the observation number and the quality of the wine? Perhaps the experts were tired and observations with larger X could indicate scoring later in the process assuming a preservation of the order.
Looking at the plot, it seems that aside from the exception of wines with quality 3, all other scoring seemed evenly distributed in terms of observations. The median tended towards the 800 mark as expected in a 1599 observation data set. One can be reasonably assured that at least there was no bias introduced through the ordering of the observations if the original order was preserved.
Principal Component Analysis
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7604 1.3878 1.2452 1.1015 0.97943 0.81216 0.76406
## Proportion of Variance 0.2817 0.1751 0.1410 0.1103 0.08721 0.05996 0.05307
## Cumulative Proportion 0.2817 0.4568 0.5978 0.7081 0.79528 0.85525 0.90832
## PC8 PC9 PC10 PC11
## Standard deviation 0.65035 0.58706 0.42583 0.24405
## Proportion of Variance 0.03845 0.03133 0.01648 0.00541
## Cumulative Proportion 0.94677 0.97810 0.99459 1.00000
## Standard deviations:
## [1] 1.7604353 1.3877715 1.2452082 1.1014684 0.9794346 0.8121627 0.7640623
## [8] 0.6503512 0.5870623 0.4258323 0.2440457
##
## Rotation:
## PC1 PC2 PC3 PC4
## fixed.acidity 0.48931422 -0.110502738 0.12330157 -0.229617370
## volatile.acidity -0.23858436 0.274930480 0.44996253 0.078959783
## citric.acid 0.46363166 -0.151791356 -0.23824707 -0.079418256
## residual.sugar 0.14610715 0.272080238 -0.10128338 -0.372792562
## chlorides 0.21224658 0.148051555 0.09261383 0.666194756
## free.sulfur.dioxide -0.03615752 0.513566812 -0.42879287 -0.043537818
## total.sulfur.dioxide 0.02357485 0.569486959 -0.32241450 -0.034577115
## density 0.39535301 0.233575490 0.33887135 -0.174499758
## pH -0.43851962 0.006710793 -0.05769735 -0.003787746
## sulphates 0.24292133 -0.037553916 -0.27978615 0.550872362
## alcohol -0.11323206 -0.386180959 -0.47167322 -0.122181088
## PC5 PC6 PC7 PC8
## fixed.acidity 0.08261366 -0.10147858 0.35022736 -0.17759545
## volatile.acidity -0.21873452 -0.41144893 0.53373510 -0.07877531
## citric.acid 0.05857268 -0.06959338 -0.10549701 -0.37751558
## residual.sugar -0.73214429 -0.04915555 -0.29066341 0.29984469
## chlorides -0.24650090 -0.30433857 -0.37041337 -0.35700936
## free.sulfur.dioxide 0.15915198 0.01400021 0.11659611 -0.20478050
## total.sulfur.dioxide 0.22246456 -0.13630755 0.09366237 0.01903597
## density -0.15707671 0.39115230 0.17048116 -0.23922267
## pH -0.26752977 0.52211645 0.02513762 -0.56139075
## sulphates -0.22596222 0.38126343 0.44746911 0.37460432
## alcohol -0.35068141 -0.36164504 0.32765090 -0.21762556
## PC9 PC10 PC11
## fixed.acidity 0.194020908 0.24952314 0.639691452
## volatile.acidity -0.129110301 -0.36592473 0.002388597
## citric.acid -0.381449669 -0.62167708 -0.070910304
## residual.sugar 0.007522949 -0.09287208 0.184029964
## chlorides 0.111338666 0.21767112 0.053065322
## free.sulfur.dioxide 0.635405218 -0.24848326 -0.051420865
## total.sulfur.dioxide -0.592115893 0.37075027 0.068701598
## density 0.020718675 0.23999012 -0.567331898
## pH -0.167745886 0.01096960 0.340710903
## sulphates -0.058367062 -0.11232046 0.069555381
## alcohol 0.037603106 0.30301450 -0.314525906
Throughout this analysis, it always seemed that I might possibly be missing some hidden connections between the variables. Then I remembered that principal component analysis is one powerful way to reveal hidden structure within the data-set. It can help identify how variables work together, and maybe reduce the dimensionality of the data. Principal component 1 (PC1) contains only 28% of the total variance within the 11 possibly correlated ‘independent’ variables. And as for dimensionality reduction, PC1-PC9 contains > ~98% of the variance providing a dimensionality reduction of 2 using a cutoff of 95% variance captured.
Looking at the breakdown of first two principal components: PC1 is a principal component made up of the variables related to acidity and pH and PC2 is primarily related to sulfur dioxide. PC2 is interesting because this actually in keeping with the findings in the sulphate/sulfur dioxide exploration I performed above. Sulfur dioxide both free and total are major components of PC2 but sulphate contributes almost no variance to PC2. PC3 is the one that has some significant contribution from alcohol.
The results of the principal component analysis reveals that there’s not a super dominant component, this is also in keeping with the linear regression where it took a majority of the variables to get the best scores. I think I will use PCA much earlier in my exploratory data analysis in the future.
The graphical representation of the results of the principal component analysis can allow for a quick overview of the top contributors to individual principal components as well as the proportions of and cumulative variance. The dotted lines for the PC Weights Breakdown graph makes it easier to trace the line.
Line graphs tend to suggest a trending which is not the case with PCA weights, the bar graph is perhaps a more apt graph. I chose the line graph previously because I felt it show the information with more clarity, but perhaps displaying 12 PCs together is too cluttered and not altogether useful. A set of the most important PCs with their weight distributions are probably more useful.
After the investigation of the data set and numerous dead ends, the two variables have seem to have the strongest affect on red wine quality are Volatile Acidity and Alcohol.
Volatile Acidity seems to be one of the primary contributors to quality, the box plot aids to reveal the linear nature of this input variable to the dependent. An added observation/visualization is that Citric Acid is negatively correlated with Volatile Acidity.
Principal Component Analysis
The graphical representation of the principal component analysis (PCA) helps to quickly visualize the distribution of the total variance in the data set. The bar graph showing the variance proportions for each PC and the color coordinated line graph showing the relative contributions of each variable to the respective PCs can give a quick way to find hidden relationships. The main relationships are acidity, sulfur dioxide, and alcohol corresponding to PC1, PC2 and PC3. It takes nine principal components to capture more than 95% of the total variance.
The final low hanging insight is that alcohol is linearly correlated with quality and should be a significant contributor if one was to build a predictive model around the given variables and red wine quality. It seems that at least for red wine, higher alcohol content means better quality score! The chart is also colored by the observation number ‘X’ to demonstrate that there is no bias in the inherent ordering of the observations (as in observations numbered 1-200 do not have a tendency to have higher quality ratings than observations 1400-1600).
Exploring the Red Wine data-set has been somewhat frustrating. After coming from the lesson and the Diamonds and Facebook data-set where intuition paid off at least two times out of three, the red wine’s data refused to give up anything that was not somewhat obvious. Working with a qualitative/unbalanced sampling of the output/dependent variable was somewhat challenging as well. I keep thinking that discarding the bottom 2 and the top levels of the quality factor would vastly improve correlations at the risk of making the modeling trivial and losing all meaning to the data. The variables that affect Red Wine Quality most are ‘alcohol’, ‘volatile.acidity’ and ‘sulphates’. ‘citric.acid’ though correlated reasonably high with quality has high correlation with ‘volatile.acidity’ and loses much of its impact on the linear modeling due to the relationship. For the future, perhaps a non-linear model could be implemented and tested. Certain variables that I had hoped would show better correlation through informed factoring did not perform as expected (‘free.sulfur.dioxide’). Instead ‘total.sulfur.dioxide’ gave better correlation after factoring. The exploration of the red wine data-set has surprisingly taken more time than I expected. In the future I also think I will use Principal Component Analysis earlier in the exploration to quickly see if there are hidden or even more obvious relationships between variables. R is truly a powerful data analysis tool.